% Benjamin Haibe-Kains
%
% 
% February 22 201

\documentclass[a4paper,11pt]{article}

\usepackage{helvet}
\usepackage[helvet]{sfmath}
\renewcommand{\familydefault}{\sfdefault}
\usepackage[american]{babel}
\usepackage{url}
\usepackage{hyperref}
\usepackage{a4wide}
%\usepackage{colortbl}
%\usepackage{longtable}
%\usepackage{multirow}
%\usepackage{rotating}
\usepackage{natbib}
\usepackage{graphicx}
\usepackage{authblk}
%additional packages
%\usepackage{minitoc}
%\usepackage{amsmath}
%\usepackage{bm}
%\usepackage{algorithm}
%\usepackage{algpseudocode}
\usepackage{wasysym}
\usepackage{tikz}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\title{\vspace{-3cm}\textbf{PQG short course}\\
\textit{Predictive Networks}: a new framework for inferring robust networks from gene expression data}
\author[1,2]{Benjamin Haibe-Kains}
\affil[1]{Computational Biology and Functional Genomics Laboratory, Dana-Farber Cancer Institute, Harvard School of Public Health}
\affil[2]{Center for Cancer Computational Biology, Dana-Farber Cancer Institute}
\date{February 22, 2011}

\begin{document}

\SweaveOpts{highlight=TRUE, tidy=TRUE, keep.space=TRUE, keep.blank.space=FALSE, keep.comment=TRUE}

<<setup,echo=FALSE>>=
library(pgfSweave)
setCacheDir("cache")
options(keep.source=TRUE)
@

<<initseed,echo=FALSE,results=hide,eval=TRUE>>=
set.seed(123)
@

\maketitle

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\tableofcontents

\vfill
\section*{Disclaimer}

This work was supervised by John Quackenbush and it has been done in collaboration with Christopher Bouton, Erik Bakke, James Hardwick, Walter Gillett, Catharina Olsen, Gianluca Bontempi, Amira Djebbari, Niall Prendergast, and Renee Rubio.

\section*{Materials}

All the materials required for this course are available from \url{http://dvn.iq.harvard.edu/dvn/dv/netinf}.

\clearpage
\section{Introduction}

DNA microarrays and other high-throughput omics technologies provide large datasets that often include hidden patterns of correlation between genes reflecting the complex processes that underlie cellular processes. The challenge in analyzing large-scale expression data has been to extract biologically meaningful inferences regarding these processes \textendash \ often represented as networks \textendash \ in an environment where the datasets are complex and noisy. Although many techniques have been developed in an attempt to address these issues, to date their ability to extract meaningful and predictive network relationships has been limited. 
In this PQG course\footnote{\url{http://www.hsph.harvard.edu/pqg/events/tutorials-short-courses.html}}, I will introduce a platform developed in John Quackenbush's lab, which enables inference of reliable gene interaction networks from prior biological knowledge, in the form of biomedical literature and structured databases, and from gene expression data. Using real data, I will show the benefit of using prior biological knowledge to infer networks and how to quantitatively assess the quality of such networks.

\paragraph*{Tutorial}

This tutorial describes in detail all the necessary steps to infer a gene interaction network combining prior biological knowledge and gene expression data.

\medskip
\noindent Our approach is the following:
\begin{enumerate}
\item Select a gene expression dataset and a set of genes of interest.
\item Extract priors from the biomedical literature and public structured databases using the \textit{Predictive Networks} web application.
\begin{enumerate}
\item Read the priors and put them in a form that is convenient for further analyses.
\end{enumerate}
\item Use the \Rpackage{predictionet} R package to infer a gene interaction network from priors and gene expression data.
\begin{enumerate}
\item Infer a network using the main function \Rfunction{netinf}.
\item Explore and display the topology of the resulting network.
\item Assess the stability of the network inference in cross-validation (function \Rfunction{netinf.cv}).
\item Assess quantitatively the predictive ability of the network in cross-validation (function \Rfunction{netinf.cv}).
\item Assess quantitatively the predictive ability of the network in a fully independent dataset (functions \Rfunction{netinf.predict} and \Rfunction{pred.score}).
\end{enumerate}
\item Use \Rpackage{predictionet} to statistically compare multiple gene interaction networks.
\end{enumerate}

Using a breast cancer gene expression dataset as an example (see Section~\ref{sec:dataset}) we go step by step and provide all the R code necessary to perform the entire analysis.


\clearpage
\section{Target molecular pathway and gene expression data}\label{sec:dataset}

For this tutorial, we focus on breast cancer and more specifically the PIK3CA pathway.

\subsection{PIK3CA pathway}

Deregulated PI3K-AKT signaling has been shown to be implicated in many aspects of carcinogenesis. Many genomic alterations have been found to activate this pathway, which can contribute to tumor progression, metastases and resistance to treatment. In particular, somatic mutations of the PIK3CA gene, encoding the p110$\alpha$ catalytic subunit of PI3K, have been shown to activate AKT and induce oncogenic transformation in vitro and in vivo. Mutations of the PIK3CA gene have been found in 18\% to 40\% of human breast cancers, making them one of the most common genetic aberrations in breast cancer.

\citeauthor{loi2010pik3ca} identified a list of genes being differentially expressed between breast tumors carrying the PIK3CA mutation in exon 20 and those with the wild-type PIK3CA gene \citep{loi2010pik3ca}. This gene list is provided in \texttt{files/loi2010\_pik3cags\_signature\_278.csv} and will serve as the core set of genes involved in the PIK3CA pathway.

\subsection{Breast cancer gene expression data}

We use two large gene expression datasets of primary breast tumors collected before any adjuvant therapy. The first dataset, provided in \texttt{files/marray\_breast1/minn2007\_frma.RData}, was published in \citep{wang2005geneexpression,minn2007lung} and consists of 344 breast tumors hybridized on the Affymetrix GeneChip HG-U133A, composed of 22,283 probesets. The second dataset, provided in \texttt{files/marray\_breast2/\-transbig2006affy\_frma.RData}, was published in \citep{desmedt2007strong} and consists of 198 breast tumors hybridized on the same Affymetrix GeneChip (HG-U133A). The raw data have been collected from GEO\footnote{\url{http://www.ncbi.nlm.nih.gov/geo/}}, series accession numbers GSE2034/GSE5327 and GSE7390 for the first and the second dataset respectively, and normalized using \Rfunction{frma}\footnote{Do you know the benefits of using \Rfunction{frma} over \Rfunction{rma}?}.

\medskip
\noindent Each \texttt{RData} file contains three R objects:\\
\begin{description}
\item[\Robject{data} ] matrix of gene expression data, tumors in rows, probes in columns.
\item[\Robject{annot} ] data frame of probe annotations, probes in rows, annotations in columns.
\item[\Robject{demo} ] data frame of clinical information of the breast cancer patients, patients in rows, clinical variables in columns.
\end{description}

\noindent Let's now run R in the working directory. We have to format the datasets for further analysis, using the following commands:

<<preparedata,echo=TRUE,results=hide,cache=TRUE,eval=TRUE>>=
## read the gene list with the corresponding annotations
pik3ca.list <- read.csv("files/loi2010_pik3cags_signature_278.csv", stringsAsFactors=FALSE)
rownames(pik3ca.list) <- pik3ca.list[ , "probe"]
## remove special characters in the gene symbols
nn <- gsub(pattern="[-]|[+]|[*]|[%]|[$]|[#]|[{]|[}]|[[]|[]]|[|]|[\\^]", replacement=".", x=pik3ca.list[ ,"Gene.symbol"])
pik3ca.list <- cbind(pik3ca.list, "Gene.symbol.curated"=toupper(nn))

## remove unknown and duplicated genes by respecting the order of the list
pik3ca.list <- pik3ca.list[is.na(pik3ca.list[ ,"Gene.symbol.curated"]) | !duplicated(pik3ca.list[ ,"Gene.symbol.curated"]), , drop=FALSE]
## select only the top genes
genen <- 20
if(genen < 2 || genen > nrow(pik3ca.list)) { genen <- nrow(pik3ca.list) }
pik3ca.list <- pik3ca.list[1:genen, , drop=FALSE]

## load the first dataset
load("files/marray_breast1/minn2007_frma.RData")
data1 <- data[ ,rownames(pik3ca.list), drop=FALSE]
annot1 <- annot[rownames(pik3ca.list), , drop=FALSE]
colnames(data1) <- rownames(annot1) <- pik3ca.list[ ,"Gene.symbol.curated"]
demo1 <- demo
## save the data in a CSV file
write.csv(t(data1), file="files/marray_breast1_pik3ca.csv")
## free memory
rm(list=c("data", "data.bin", "annot", "demo"))
gc()

## load the second dataset
load("files/marray_breast2/transbig2006affy_frma.RData")
data2 <- data[ ,rownames(pik3ca.list), drop=FALSE]
annot2 <- annot[rownames(pik3ca.list), , drop=FALSE]
colnames(data2) <- rownames(annot2) <- pik3ca.list[ ,"Gene.symbol.curated"]
demo2 <- demo
## save the data in a CSV file
write.csv(t(data2), file="files/marray_breast2_pik3ca.csv")
## free memory
rm(list=c("data", "data.bin", "annot", "demo"))
gc()
@

\noindent So the objects \Robject{data1}, \Robject{annot1}, \Robject{demo1} and \Robject{data2}, \Robject{annot2}, \Robject{demo2} contain the information for the first and second dataset respectively, the probe names being replaced by their corresponding gene symbols. The object \Robject{pik3ca.list} contains the annotations of the unique genes related to PIK3CA.


\clearpage
\section{Priors from the biomedical literature and public structured databases}\label{sec:priors}

In order to extract previously published gene interactions, Prof John Quackenbush initiated the development of the \textit{Predictive Networks} web application [\url{https://compbio.dfci.harvard.edu/predictivenetworks/}] implemented by Dr Christopher Bouton and his team at \textsc{Entagen}\footnote{http://www.entagen.com}. This tool enables users to easily retrieve a large number of high quality gene interactions reported in the biomedical literature (full-text open-access PubMed articles and/or MEDLINE abstracts) and/or structured biological databases (e.g., Pathway Commons) by focussing on a core set of genes.

In our case we use the \textit{Predictive Networks} web application to retrieve a list of interactions involving at least one gene included in our list of PIK3CA-related genes.To do so we first have to create an account to login in the webapp. Then go to "My Page" and create a new gene list by cutting-and-pasting the gene symbols\footnote{Be extremely careful if you use Microsoft Excel since it will automatically interpret SEPT6, that is the gene "SEPTIN 6", as the 6th of September!} included in \texttt{files/loi2010\_pik3cags\_signature\_278.csv}. Lastly, we have to export all the resulting triples (e.g., "$gene_a$ regulates $gene_b$") in a CSV file by clicking on "View Triples" and then "Download w/ Sentences As". Name this CSV file as \texttt{priors\_pik3ca.csv} in the \texttt{files/} directory.

Let's now read the newly generated priors into our R session.

<<preparepriors,echo=TRUE,results=hide,eval=TRUE>>=
## read priors generated by the Predictive Networks web application
pn.priors <- read.csv("files/priors_pik3ca.csv", stringsAsFactors=FALSE)
## the column names should be: subject, predicate, object, direction, evidence, sentence, article, network

## remove special characters in the gene symbols
pn.priors[ ,"subject"] <- gsub(pattern="[-]|[+]|[*]|[%]|[$]|[#]|[{]|[}]|[[]|[]]|[|]|[\\^]", replacement=".", x=pn.priors[ ,"subject"])
pn.priors[ ,"object"] <- gsub(pattern="[-]|[+]|[*]|[%]|[$]|[#]|[{]|[}]|[[]|[]]|[|]|[\\^]", replacement=".", x=pn.priors[ ,"object"])

## missing values
pn.priors[!is.na(pn.priors) & (pn.priors == "" | pn.priors == " " | pn.priors == "N/A")] <- NA

## select only the interactions in which the genes are comprised in our gene expression dataset
myx <- is.element(pn.priors[ , "subject"], pik3ca.list[ ,"Gene.symbol.curated"]) & is.element(pn.priors[ , "object"], pik3ca.list[ ,"Gene.symbol.curated"])
pn.priors <- pn.priors[myx, , drop=FALSE]
@

\noindent The R object \Robject{pn.priors} is a matrix that contains the triples $gene_a \rightarrow gene_b$ that have been reported in the literature and the structured biological databases.

<<tablepriors,echo=TRUE,results=verbatim,eval=TRUE>>=
## print the first rows of 'pn.priors'
print(head(pn.priors))
@

\noindent As we will see later it is convenient to transform this long list of triples in a square matrix of counts containing the number of times an interaction $gene_a \rightarrow gene_b$ has been reported in the literature and the structured biological databases.

<<preparepriorscount,echo=TRUE,results=hide,eval=TRUE>>=
## build prior counts
priors.count <- matrix(0, nrow=nrow(pik3ca.list), ncol=nrow(pik3ca.list), dimnames=list(pik3ca.list[ ,"Gene.symbol.curated"], pik3ca.list[ ,"Gene.symbol.curated"]))

for(i in 1:nrow(pn.priors)) {
	switch(tolower(pn.priors[i, "direction"]), 
	"right"={ priors.count[pn.priors[i, "subject"], pn.priors[i, "object"]] <- priors.count[pn.priors[i, "subject"], pn.priors[i, "object"]] + ifelse(!is.na(pn.priors[i, "evidence"]), ifelse(tolower(pn.priors[i, "evidence"]) == "positive", +1, -1), 0) }, 
	"left"={ priors.count[pn.priors[i, "object"], pn.priors[i, "subject"]] <- priors.count[pn.priors[i, "object"], pn.priors[i, "subject"]] +  ifelse(!is.na(pn.priors[i, "evidence"]), ifelse(tolower(pn.priors[i, "evidence"]) == "positive", +1, -1), 0) }, 
	{ priors.count[pn.priors[i, "subject"], pn.priors[i, "object"]] <- priors.count[pn.priors[i, "subject"], pn.priors[i, "object"]] +  ifelse(!is.na(pn.priors[i, "evidence"]), ifelse(tolower(pn.priors[i, "evidence"]) == "positive", +1, -1), 0)
	if(pn.priors[i, "object"] != pn.priors[i, "subject"]) { priors.count[pn.priors[i, "object"], pn.priors[i, "subject"]] <- priors.count[pn.priors[i, "object"], pn.priors[i, "subject"]] +  ifelse(!is.na(pn.priors[i, "evidence"]), ifelse(tolower(pn.priors[i, "evidence"]) == "positive", +1, -1), 0) } })
}
## negative count represent evidence for ABSENCE of an interaction, positive otherwise
@

As can be seen, few interactions have been previously reported; this is due to the fact that, to date, the genes related to PIK3CA published in \citep{loi2010pik3ca} have been only poorly characterized.

<<priorscounttab,echo=TRUE,results=verbatim,eval=TRUE>>=
print(table(priors.count))
@

\noindent Out of the \Sexpr{genen * genen} possible interactions, only \Sexpr{sum(priors.count != 0, na.rm=TRUE)} have been reported at least once.

\clearpage
\section{Network inference from priors and gene expression data}

Many methods have been developed for network inference: Bayesian networks, nested $q$-partial graphs, information-theoretic networks, regression-based networks,\dots These methods differ greatly in their way of dealing with the potentially high dimensionality of the problem, handling missing values, learning from observations and/or interventions, combining genomic data and prior knowledge, assessing the quality of fitted networks, and their ability to make predictions.

In this course, I present a regression-based network inference approach we specifically developed
\begin{itemize}
\item to deal efficiently with large number of genes (potentially $\geq 100$),
\item to avoid discretization of the input values,
\item to combine prior knowledge and gene expression data,
\item to infer stable gene interaction networks,
\item to make predictions about the expression of genes of interest in independent data.
\end{itemize}

\subsection{Methodology}

The design of our regression-based approach for network inference is illustrated in Figure~\ref{fig:regrnetdesign}. The extraction of prior knowledge about gene interactions and its transformation to obtain a directed graph have been explained in Section~\ref{sec:priors}. In order to infer a directed graph from gene expression data we first build an undirected graph using the \textit{maximum relevance minimum redundancy} (MRMR) feature selection technique \citep{ding2005minimum,meyer2007information} and we then infer causality \citep{cheng2002learning,olsen2009inferring}. The network topology is determined by combining the directed graphs inferred from priors and gene expression data. Using the resulting topology, we can fit regression models in order to assess the predictive ability of the network model.

\begin{figure}[htbp]
\centering
\includegraphics[keepaspectratio=true,height=10cm]{files/regrnet_design}
\caption{Design of the regression-based network inference approach.}
\label{fig:regrnetdesign}
\end{figure}


\subsubsection*{MRMR}

To infer an undirected graph from gene expression data we use the maximum relevance minimum redundancy (MRMR) feature selection technique \citep{ding2005minimum,meyer2007information}.  For this iterative forward selection technique, at each step a different gene is used as the target gene. The algorithm selects then the genes that exhibits the highest difference between mutual information with the target gene and redundancy with the previously selected genes. This difference acts as a score for the possible interactions. The interactions which are represented by a score lower than a given threshold will be deleted in order to infer the undirected graph. The idea is that direct interactions should receive a high score whereas indirect interactions a low one.
In the first step, the gene $X_i$ which has the highest mutual information to the target gene $Y$ is selected. The second selected variable $X_j$ will be the one with a high information $I(X_j,Y)$ to the target and at the same time a low information $I(X_j,X_i)$ to the previously selected variable.
In the next steps, given a set $X_\mathcal{S}$ of selected variables, the criterion updates $X_\mathcal{S}$ by choosing the next gene $X_q$ that maximizes the score
$$ s_q = I(X_q,Y) - \frac{1}{| \mathcal{S} |} \sum_{X_k \in X_\mathcal{S}}{I(X_q,X_k)} $$which can be described as a relevance term minus a redundancy term. 
%For each pair {Xi,Xj} the algorithm returns two scores si and sj and computes the maximum of the two. All edges with a score below a given threshold are then deleted.

The feature selection stop when the number of genes connected to the target is the maximum number of parents allowed by the user. The resulting undirected graph is locally acyclic since the target gene cannot be selected.

\subsubsection*{Causality inference}

Once the undirected graph is built, we have to perform the arc orientation. Based on previous works on causality (see \citep{neopalitan2003learning} for a review), we can actually infer causality by using the "explaining away effect" which states that a given common effect creates a dependency between two variables (this can be graphically represented by the $v$-structure, where the value of the collider is given). In  \citep{ding2005minimum, meyer2008information} this effects has been related to the property

$$ I(X,Z|Y) > I(X,Z) \Leftrightarrow C(X,Y,Z) < 0 $$

where $I(X,Z)$ is the mutual information of $X$ and $Z$, $I(X,Z|Y)$ is the mutual information of $X$ and $Z$ given $Y$, and $C(X,Y,Z)$ is the interaction information of $X$, $Y$ and $Z$ such that
\begin{eqnarray}
C(X,Y,Z) & = & I(x,Y) - I(X,Y|Z) \nonumber \\
 & = & I(X,Z) - I(X,Z|Y) \nonumber \\
 & = & I(Y,Z) - I(Y,Z|X) \nonumber
\end{eqnarray}


If the undirected (acyclic) graph is given, then for every triple of variables $X \relbar Y \relbar Z$ the property (5.6) corresponds to a v-structure $X \rightarrow Y \leftarrow Z$. On the contrary, knowing only that the interaction information is less than zero does not help with inferring the network because the three possible differences between mutual information and conditional mutual are equivalent. Thus, it is not evident which variable should be the collider.

Therefore, the only possibility for orienting the arcs using the interaction information is to use the following property:

\begin{quote}
Given three variables $X$, $Y$ and $Z$ and a structure $X \relbar Y \relbar Z$, then if $I(X,Z|Y) > I(X,Z)$, i.e., $C(X,Y,Z) < 0$, then $X \rightarrow Y \leftarrow Z$

%If the structure is given by $X \rightarrow Y \relbar Z$ and if $I(X,Z|Y) \leq I(X,Z)$, i.e., $C(X,Y,Z) \geq 0$, then 
%$$ X \rightarrow Y \rightarrow Z$$
\end{quote}

\noindent Depending on the data, we can now identify some of the genes as parents.

\subsubsection*{Combination with priors}

To infer the final network topology, we combine, for each interaction between $gene_i$ and $gene_j$, the score based on priors ($P_{ij}$) and MRMR+causality~inference ($M_{ij}$). We let users control the weight put on the priors, i.e., their confidence on the prior knowledge gathered on internet:
$$ w P{ij} + (1-w) M_{ij} $$
where $w \in [0,1]$.

Users can then easily infer networks using prior knowledge only ($w = 1$) or using gene expression data only ($w = 0$). However, we will see in our real case study that a combination of priors and gene expression data usually yields inference of better networks.

\subsubsection*{Predictive ability of the network model}

The last step of our network inference approach is the fitting of local regression models to enable prediction of gene expressions. To do so, we fit a penalized linear regression model \citep{park2007l1} for each gene in networks using its parents as predictors.


\bigskip
This novel approach is implemented in the \Rpackage{predictionet} R package and is also integrated in the \textit{Predictive Networks} web application (see "Analysis" panel). Even if the package is not yet available from BioConductor\footnote{\url{http://www.bioconductor.org}}, a public Git repository is accessible from \url{https://github.com/bhaibeka/predictionet}; any non-violent and constructive feedback is welcome. \smiley

First we should install the \Rpackage{predictionet} package using the following command

<<installpackage,echo=TRUE,results=hide,eval=FALSE>>=
install.packages(pkgs="predictionet_0.7.tar.gz", repos=NULL)
@

If some dependencies are missing, such as the \Rpackage{penalized} package, you can install them using the \Rfunction{install.packages} and specify the Comprehensive R Archive Network (CRAN) as repository

<<installpackagedep,echo=TRUE,results=hide,eval=FALSE>>=
install.packages("penalized", repos="http://cran.r-project.org", dependencies=TRUE)
@

If the package has to be installed on a Windows machine, the compilation may fail because you lack of a proper C compiler. Then we should use the R GUI and install the file \texttt{predictionet\_0.7.zip} that was precompiled specifically for Windows.

\medskip
\noindent Let's load the library into our R session.

<<loadpredictionet,echo=TRUE,results=hide,eval=TRUE>>=
library(predictionet)
@

\noindent We are now able to run all the functions implemented in the \Rpackage{predictionet} package. These function are listed in the ocumentation of the package itself.

<<foopredictionet,echo=TRUE,results=hide,eval=FALSE>>=
library(help=predictionet)
@


\subsection{Run your first network inference with \Rpackage{predictionet}}

Let's infer our first network and go through the main options of the package. The main function of the \Rpackage{predictionet} package is \Rfunction{netinf} with the following parameters:
.
\begin{description}
  \item[data ] Matrix of continuous or categorical values with observations in rows and features in columns. 
  \item[categories ] If this parameter is missing, \Robject{data} should be already discretized; otherwise the parameter should be either a single integer or a vector of integers specifying the number of categories used to discretize each variable (data are then discretized using equal-frequency bins) or a list of cutoffs to use in order to discretize each of the variables in \Robject{data} matrix. If method='bayesnet', this parameter should be specified by the user. 
  \item[perturbations ] Matrix of \{0, 1\} specifying whether a gene has been perturbed (e.g., knockdown, over-expression) in some experiments. Dimensions should be the same than \Robject{data}.
  \item[priors ] Matrix of prior information available for gene-gene interaction (parents in rows, children in columns). Values may be probabilities or any other weights (citations count for instance).
	\item[predn ] Indices or names of variables to fit during network inference. If missing, all the variables will be used for network inference.
\item[priors.count ] \Robject{TRUE} if priors specified by the user are number of citations (count) for each interaction, \Robject{FALSE} if probabilities or any other weight lying in $[0,1]$ are reported instead.
  \item[priors.weight ] Real value in [0, 1] specifying the weight to put on the priors (0=only the data are used, 1=only the priors are used to infer the topology of the network).
  \item[maxparents ] Maximum number of parents allowed for each gene.
  \item[subset ] Vector of indices to select only subset of the observations.
  \item[method ] 'bayesnet' for Bayesian network inference with the \Robject{catnet} package (not implemented yet), 'regrnet' for regression-based network inference. 
	\item[causal ] \Robject{TRUE} if the causality should be inferred from the data, \Robject{FALSE} otherwise 
	\item[seed ] Set the seed to make the network inference deterministic. 
\end{description}

\noindent You can easily access this description by consulting the help page of the \Rfunction{netinf}

<<helpnetinf,echo=TRUE,results=hide,eval=FALSE>>=
help(netinf)
@

\noindent You can infer a gene interaction network using the entire dataset \Robject{data1}, by equally balancing the importance of priors and data in the network inference process (\Robject{priors.weight} = 0.5).

<<firstnetinf,echo=TRUE,results=hide,cache=TRUE,eval=TRUE>>=
mynet <- netinf(data=data1, priors=priors.count, priors.count=TRUE, priors.weight=0.5, maxparents=3, method="regrnet", seed=54321)
@

\noindent NB: You may get some warning messages, this is due to the \Rpackage{penalized} package which is used internally; it will be fixed in a next release.

\medskip
\noindent Now let's display the topology of the network we just inferred (see Figure~\ref{fig:topoglobal}) using the \Rfunction{plot} function of the \Rpackage{network} package.

<<regrnetopofig,echo=TRUE,results=hide,eval=FALSE>>=
## network topology
mytopoglobal <- net2topo(net=mynet, coefficients=FALSE)
library(network)
xnet <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]])
mynetlayout <- plot.network(x=xnet, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5)
@

\begin{figure}[h!]
\centering
<<regrnetopofig,echo=FALSE,fig=TRUE,height=6,width=6,results=hide,eval=TRUE>>=
## network topology
mytopoglobal <- net2topo(net=mynet, coefficients=FALSE)
library(network)
xnet <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]])
mynetlayout <- plot.network(x=xnet, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.6)
@
\caption{Directed graph representing the topology of the network inferred from priors and gene expression data.}
\label{fig:topoglobal}
\end{figure}

\noindent Unfortunately, such kind of plots are not interactive and we cannot change the position of the genes as we would like. There exist other packages to display networks in R, some of them are interactive: \Rpackage{igraph}, \Rpackage{Graphviz},\dots

\medskip
\noindent A quick look at the topology in Figure~\ref{fig:topoglobal} allows us to identify PIK3R3 and RPL14 as highly connected genes (also referred to as \textit{hub} in the literature). By searching for these two genes in GeneSigDB\footnote{\url{http://compbio.dfci.harvard.edu/genesigdb/}}, a manually curated database of published gene signatures developed by Dr Aedin Culhane, we can see that PIK3R3 and RPL14, in addition to be included in the PIK3CA signature published in \citep{loi2010pik3ca}, are also part of other signatures published in breast, leukemia, ovarian and bladder cancers.

We could also highlight the interactions that were already known (\textbf{\textcolor{blue}{blue}}), the new interactions supported by the gene expression (\textbf{\textcolor{green}{green}}) and the interactions both known and supported by the data (\textbf{\textcolor{red}{red}}), see Figure~\ref{fig:topocol}.

<<edgecoldiff,echo=TRUE,results=hide,eval=TRUE>>=
## preparing colors
mycols <- c("blue", "green", "red")
names(mycols) <- c("prior", "data", "both")
myedgecols <- matrix("white", nrow=nrow(mytopoglobal), ncol=ncol(mytopoglobal), dimnames=dimnames(mytopoglobal))
## prior only
myedgecols[mytopoglobal == 0 & priors.count >= 1] <- mycols["prior"]
## data only
myedgecols[mytopoglobal == 1 & priors.count < 1] <- mycols["data"]
## both in priors and data
myedgecols[mytopoglobal == 1 & priors.count >= 1] <- mycols["both"]
@

<<edgecoldiffig,echo=TRUE,results=hide,eval=FALSE>>=
mytopoglobal2 <- (myedgecols != "white") + 0
## network topology
xnet2 <- network(x=mytopoglobal2, matrix.type="adjacency", directed=TRUE, loops=TRUE, vertex.attrnames=dimnames(mytopoglobal2)[[1]])
plot.network(x=xnet2, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=myedgecols, coord=mynetlayout)
@

\begin{figure}[h!]
\centering
<<edgecoldiffig,echo=FALSE,fig=TRUE,height=6,width=6,results=hide,eval=TRUE>>=
mytopoglobal2 <- (myedgecols != "white") + 0
## network topology
xnet2 <- network(x=mytopoglobal2, matrix.type="adjacency", directed=TRUE, loops=TRUE, vertex.attrnames=dimnames(mytopoglobal2)[[1]])
plot.network(x=xnet2, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=myedgecols, coord=mynetlayout)
@
\caption{Network topology where the edges are colored according to their source of evidence: \textbf{\textcolor{blue}{blue}} for edges supported by the priors only, \textbf{\textcolor{green}{green}} for edges supported by the data only, and \textbf{\textcolor{red}{red}} for edges supported by both the priors and the data.}
\label{fig:topocol}
\end{figure}

\noindent In Figure~\ref{fig:topocol} we can see that most of the interactions included in the priors have been also supported by the data (\textbf{\textcolor{red}{red}}) and have been inferred in the final network; only the self loops (\textbf{\textcolor{blue}{blue}}), which are not allowed by our regression-based network inference method, have been discarded. Many new interactions (\textbf{\textcolor{green}{green}}) have been inferred from the gene expression data.

\medskip
\noindent Although of interest, the topology does not tell us much about the quality of the network inference. Therefore we implemented two statistics to help us focus on the gene interactions that are well supported by the data:
\begin{itemize}
\item edge-specific stability.
\item gene-specific predictive ability.
\end{itemize}

\noindent The idea is to use a cross-validation procedure\footnote{\url{http://en.wikipedia.org/wiki/Cross-validation_(statistics)}} to infer multiple networks from different training datasets to assess both the stability of the inference and its predictive ability. The function \Rfunction{netinf.cv}, although computationally intensive, is doing all the work for us. The vast majority of the parameters are the same than for the \Rfunction{netinf} function, with some additions such as \Robject{nfold} that is the number of folds for the cross-validation procedure.

<<regrnetinfcv,echo=TRUE,results=hide,cache=TRUE,eval=TRUE>>=
myres.cv <- netinf.cv(data=data1, categories=3, priors=priors.count, maxparents=3, priors.weight=0.5, method="regrnet", nfold=10, seed=54321)
@

\noindent The object \Robject{myres.cv} contains a lot of information related to the network inference process:

<<rescv,echo=TRUE,results=verbatim,eval=TRUE>>=
print(str(myres.cv, 1))
@

\noindent with
\begin{description}
\item[net ] Model object of the network inferred using the entire dataset.
\item[net.cv ] List of models network models fitted at each fold of the cross-validation.
\item[topology ] Topology of the model inferred using the entire dataset.
\item[topology.cv ] Topology of the networks inferred at each fold of the cross-validation.
\item[prediction.score.cv ] List of prediction scores ($R^2$, $NRMSE$, $MCC$) computed at each fold of the cross-validation.
\item[edge.stability ] Stability of the edges inferred during cross-validation; only the stability of the edges present in the network inferred using the entire dataset is reported.
\item[edge.stability.cv ] Stability of the edges inferred during cross-validation.
\end{description}

\noindent We can now extract these statistics to better quantify the robustness of the network inference.


\subsection{Edge-specific stability}

At each fold of the cross-validation, a network is inferred using a slightly different dataset. This variation in the set of observations used to fit the network model induces some variation at the level of the inference. Some edges may be poorly supported by the data and therefore their inference strongly depends on the training dataset and is not generalizable. Since we performed a 10-fold cross-validation, we can calculate for each edge its presence frequency in the inferred networks, the most robust edge being inferred 10 times, the less robust ones only once or twice.

\medskip
\noindent Let's display the topology of the network inferred from the entire dataset where the edges are colored according to their stability (Figure~\ref{fig:topostab}).

<<edgecol,echo=TRUE,results=hide,eval=TRUE>>=
## preparing colors
ii <- 0:10
mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5)))
names(mycols) <- as.character(ii/10)
myedgecols <- matrix("#00000000", nrow=nrow(mytopoglobal), ncol=ncol(mytopoglobal), dimnames=dimnames(mytopoglobal))
for(k in 1:length(mycols)) { myedgecols[myres.cv$edge.stability == names(mycols)[k]] <- mycols[k] }
myedgecols[!mytopoglobal] <- "#00000000"
@

<<edgestab,echo=TRUE,results=hide,eval=FALSE>>=
def.par <- par(no.readonly=TRUE)
layout(rbind(1,2), heights=rbind(8,1))
par(mar=c(1,1,1,1))
## network topology
xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]])
plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=myedgecols, coord=mynetlayout)
par(mar=c(0,3,1,3))
plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="Stability scale", cex.main=1)
rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey")
text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1)
par(def.par)
@

\begin{figure}[h!]
\centering
<<edgestabfig,echo=FALSE,fig=TRUE,height=6,width=6,results=hide,eval=TRUE>>=
def.par <- par(no.readonly=TRUE)
layout(rbind(1,2), heights=rbind(8,1))
par(mar=c(1,1,1,1))
## network topology
xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]])
plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=myedgecols, coord=mynetlayout)
par(mar=c(0,3,1,3))
plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="Stability scale", cex.main=1)
rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey")
text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1)
par(def.par)
@
\caption{Network topology where the edges are colored according to their stability.}
\label{fig:topostab}
\end{figure}

\noindent As can be seen, the inference of half of the edges is very stable, especially around the two "hubs" that are RPL14 and PIK3R3. However the inference of the interactions with SPET6 is unstable.


\subsection{Gene-specific predictive ability}

Since our regression-based network inference approach actually fits local regression models to assess the dependence between the parent genes and the target/child gene, we can actually quantify the strength of this dependence by assessing the predictive ability of the network model for each individual gene. Several performance criteria have been implemented so far: 
\begin{itemize}
\item $R^2$: proportion of variance of the target/child gene explained by the regression model. The value lies in $ \{0, 1\}$, the larger, the better.
\item $NRMSE$: normalized root mean squared error of the regression model. The values are $ > 0$, the lower, the better.
%The value lies approximately in $ \{0, 1\}$, the lower, the better.
\item $MCC$: Matthews correlation coefficient (also called multi-class correlation). This is a performance criterion widely used in the classification framework so it requires first a discretization of the gene expression data in classes (this is done implicitly by the functions \Rfunction{netinf.cv} and \Rfunction{pred.score}). The value lies in $ \{0, 1\}$, the larger, the better.
\end{itemize}

\noindent Let's focus on $R^2$ and $MCC$. We can easily display the topology by coloring nodes according to the predictive ability of the network estimated by $R^2$ (Figure~\ref{fig:topopredabr2}) and $MCC$ (Figure~\ref{fig:topopredabmcc}). Here is the code for generating the plot reporting the $R^2$ estimations:

<<genecolr2,echo=TRUE,results=hide,eval=TRUE>>=
myr2 <- apply(myres.cv$prediction.score.cv$r2, 2, mean, na.rm=TRUE)
myr2 <- round(myr2*10)/10
## preparing colors
ii <- 0:10
mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5)))
names(mycols) <- as.character(ii/10)
myvertexcols <- mycols[match(myr2, names(mycols))]
names(myvertexcols) <- names(myr2)
@

<<edgestab,echo=TRUE,results=hide,eval=FALSE>>=
def.par <- par(no.readonly=TRUE)
layout(rbind(1,2), heights=rbind(8,1))
par(mar=c(1,1,1,1))
## network topology
xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]])
plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout)
par(mar=c(0,3,1,3))
plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="R^2", cex.main=1)
rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey")
text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1)
par(def.par)
@

\begin{figure}[h!]
\centering
<<genepredabr2fig,echo=FALSE,fig=TRUE,height=6,width=6,results=hide,eval=TRUE>>=
def.par <- par(no.readonly=TRUE)
layout(rbind(1,2), heights=rbind(8,1))
par(mar=c(1,1,1,1))
## network topology
xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]])
plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout)
par(mar=c(0,3,1,3))
plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$R^2$", cex.main=1)
rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey")
text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1)
par(def.par)
@
\caption{Network topology where the genes are colored according to the predictive performance of the network measure by $R^2$ in cross-validation.}
\label{fig:topopredabr2}
\end{figure}


\noindent A similar piece of code could be used to generate the plot reporting the $MCC$ estimations.

<<genecolmcc,echo=FALSE,results=hide,eval=TRUE>>=
mymcc <- apply(myres.cv$prediction.score.cv$mcc, 2, mean, na.rm=TRUE)
mymcc <- round(mymcc*20)/20
mymcc[mymcc < 0.5] <- 0.5
## preparing colors
ii <- 0:10
mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5)))
names(mycols) <- seq(0.5, 1, 0.05)
myvertexcols <- mycols[match(mymcc, names(mycols))]
names(myvertexcols) <- names(mymcc)
@

\begin{figure}[h!]
\centering
<<genepredabmccfig,echo=FALSE,fig=TRUE,height=6,width=6,results=hide,eval=TRUE>>=
def.par <- par(no.readonly=TRUE)
layout(rbind(1,2), heights=rbind(8,1))
par(mar=c(1,1,1,1))
## network topology
xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]])
plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout)
par(mar=c(0,3,1,3))
plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$MCC$", cex.main=1)
rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey")
text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1)
par(def.par)
@
\caption{Network topology where the genes are colored according to the predictive performance of the network measured by $MCC$ in cross-validation.}
\label{fig:topopredabmcc}
\end{figure}

\medskip
\noindent As can be seen in Figures~\ref{fig:topopredabr2} and ~\ref{fig:topopredabmcc}, the genes SCGB2A2 and  SCGB1D2 are surprisingly well predicted by the network; in fact these two gene are highly correlated (Pearson's correlation coefficient of \Sexpr{sprintf("%.2g", cor(data1[ ,"SCGB1D2"], data1[ ,"SCGB2A2"], use="complete"))}). If we look at the rest of the genes we can see that the assessment of the network predictive ability is concordant with its stability analysis: the genes around RPL14 and PIK3R3 are quite well predicted while the genes connected with unstable edges are poorly predicted.


\paragraph{Validation}

Even though we used a proper cross-validation procedure to get an unbiased estimate of the predictive ability of the model, it is of interest to validate our network model in a fully independent dataset. The task is challenging because of the various biases that are inevitably present in a dataset generated from a different population of breast cancer patients and in a different lab. Although the resulting hidden batch effects might dramatically drive down the performance of a model, this additional validation step is necessary to assess the quality of the model in a real word situation where we cannot control for all the batch effects.

In this course, we use our second dataset of breast cancer patients to validate the performance of the network model inferred from the first dataset. we can actually compute the predictions and the corresponding $R^2$ and $MCC$ estimates using the following pice of code and then display the network using the same code than before (see Figures~\ref{fig:topopredabr2test} and \ref{fig:topopredabmcctest} for the $R^2$ and $MCC$ respectively).


<<validationpred,echo=TRUE,results=hide,eval=TRUE>>=
## compute predictions
mynet.test.pred <-  netinf.predict(net=mynet, data=data2)
## performance estimation: R2
mynet.test.r2 <- pred.score(data=data2, pred=mynet.test.pred, method="r2")
## performance estimation: MCC
mynet.test.mcc <- pred.score(data=data2, categories=3, pred=mynet.test.pred, method="mcc")
@


<<genecolr2test,echo=FALSE,results=hide,eval=TRUE>>=
mynet.test.r2 <- round(mynet.test.r2*10)/10
## preparing colors
ii <- 0:10
mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5)))
names(mycols) <- as.character(ii/10)
myvertexcols <- mycols[match(mynet.test.r2, names(mycols))]
names(myvertexcols) <- names(mynet.test.r2)
@

\begin{figure}[h!]
\centering
<<genepredabr2testfig,echo=FALSE,fig=TRUE,height=6,width=6,results=hide,eval=TRUE>>=
def.par <- par(no.readonly=TRUE)
layout(rbind(1,2), heights=rbind(8,1))
par(mar=c(1,1,1,1))
## network topology
xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]])
plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout)
par(mar=c(0,3,1,3))
plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$R^2$", cex.main=1)
rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey")
text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1)
par(def.par)
@
\caption{Network topology where the genes are colored according to the predictive performance of the network measured by $R^2$ in a fully independent dataset.}
\label{fig:topopredabr2test}
\end{figure}


<<genecolmcctest,echo=FALSE,results=hide,eval=TRUE>>=
mynet.test.mcc <- round(mynet.test.mcc*20)/20
mynet.test.mcc[mynet.test.mcc < 0.5] <- 0.5
## preparing colors
ii <- 0:10
mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5)))
names(mycols) <- seq(0.5, 1, 0.05)
myvertexcols <- mycols[match(mynet.test.mcc, names(mycols))]
names(myvertexcols) <- names(mynet.test.mcc)
@

\begin{figure}[h!]
\centering
<<genepredabmcctestfig,echo=FALSE,fig=TRUE,height=6,width=6,results=hide,eval=TRUE>>=
def.par <- par(no.readonly=TRUE)
layout(rbind(1,2), heights=rbind(8,1))
par(mar=c(1,1,1,1))
## network topology
xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]])
plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout)
par(mar=c(0,3,1,3))
plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$MCC$", cex.main=1)
rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey")
text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1)
par(def.par)
@
\caption{Network topology where the genes are colored according to the predictive performance of the network measured by $MCC$ in a fully independent dataset.}
\label{fig:topopredabmcctest}
\end{figure}

\noindent As expected, the performance of our model slightly decreased but the vast majority of our observations remain true.

\clearpage
\section{Statistical comparison of network models}

Based on our approach to quantify the stability and predictive ability of a gene interaction network, we are now able to statistically compare the performance of two or more networks. This network model selection could help us optimize a parameter such as the weight of the priors during the network inference (\Robject{priors.weight}) or identify the best method of network inference (Bayesian vs regression-based network inference for instance).

\medskip
\noindent Let's infer and compare the predictive ability (estimated in cross-validation) of five networks with \Robject{priors.weight} = 0, 0.25, 0.5, 0.75 and 1.

<<regnetcvpriorsweight,echo=TRUE,results=hide,cache=TRUE,eval=TRUE>>=
## priors.weight 0
myres.cv.pw0 <- netinf.cv(data=data1, categories=3, priors=priors.count, maxparents=3, priors.weight=0, method="regrnet", nfold=10, seed=54321)
## priors.weight 0.25
myres.cv.pw025 <- netinf.cv(data=data1, categories=3, priors=priors.count, maxparents=3, priors.weight=0.25, method="regrnet", nfold=10, seed=54321)
## priors.weight 0.5
myres.cv.pw050 <- myres.cv
## priors.weight 0.75
myres.cv.pw075 <- netinf.cv(data=data1, categories=3, priors=priors.count, maxparents=3, priors.weight=0.75, method="regrnet", nfold=10, seed=54321)
## priors.weight 0
myres.cv.pw1 <- netinf.cv(data=data1, categories=3, priors=priors.count, maxparents=3, priors.weight=1, method="regrnet", nfold=10, seed=54321)
@

%\medskip
\noindent Now let's display the topology of the networks we just inferred by varying the weight we put on the priors (see Figure~\ref{fig:topoglobals}).

<<regnetcvpriorsweightfig,echo=TRUE,results=hide,eval=FALSE>>=
def.par <- par(no.readonly=TRUE)
layout(mat=matrix(1:4, nrow=2, ncol=2, byrow=TRUE))
## priors.weight 0
mytopot <- net2topo(net=myres.cv.pw0$net, coefficients=FALSE)
xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]])
plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0 (data only)")
## priors.weight 0.25
mytopot <- net2topo(net=myres.cv.pw025$net, coefficients=FALSE)
xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]])
plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0.25")
## priors.weight 0.75
mytopot <- net2topo(net=myres.cv.pw075$net, coefficients=FALSE)
xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]])
plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0.75")
## priors.weight 1
mytopot <- net2topo(net=myres.cv.pw1$net, coefficients=FALSE)
xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]])
plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 1 (priors only)")
par(def.par)
@

\begin{figure}[h!]
\centering
<<regnetcvpriorsweightfig2,echo=FALSE,fig=TRUE,height=7,width=7,results=hide,eval=TRUE>>=
def.par <- par(no.readonly=TRUE)
layout(mat=matrix(1:4, nrow=2, ncol=2, byrow=TRUE))
## priors.weight 0
mytopot <- net2topo(net=myres.cv.pw0$net, coefficients=FALSE)
xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]])
plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0 (data only)")
## priors.weight 0.25
mytopot <- net2topo(net=myres.cv.pw025$net, coefficients=FALSE)
xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]])
plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0.25")
## priors.weight 0.75
mytopot <- net2topo(net=myres.cv.pw075$net, coefficients=FALSE)
xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]])
plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0.75")
## priors.weight 1
mytopot <- net2topo(net=myres.cv.pw1$net, coefficients=FALSE)
xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]])
plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 1 (priors only)")
par(def.par)
@
\caption{Directed graph representing the topology of the networks inferred from priors and gene expression data with varying prior weight.}
\label{fig:topoglobals}
\end{figure}

\noindent As can be seen in Figure~\ref{fig:topoglobals}, the networks generated from data only (\Robject{priors.weight} = 0) and from priors only (\Robject{priors.weight} = 1) are very sparse, and the combination of priors and data enables the identification of more gene interactions. But a denser topology does not imply that all the inferred interactions are well supported by the data. In order to select the best network(s) we can statistically compare their stability and predictive ability.

\subsection{Comparison of edge-specific stability}

In order to compare the stability of the network inference, we can draw a boxplot representing the edge-specific stability of the network inference with respect to the weight put on the priors (Figure~\ref{fig:boxplotstabpw}). As expected, the network inferred from priors only is perfectly stable (the inference does not depend on the data anymore) but what is interesting is the important gain in stability when the priors are used as compared to networks inferred from data alone.

<<boxplotstabpwcvfig,echo=TRUE,results=hide,eval=FALSE>>=
gg <- c("0", "0.25", "0.50", "0.75", "1")
mystab.cv.pw <- list(myres.cv.pw0$edge.stability[myres.cv.pw0$topology == 1], myres.cv.pw025$edge.stability[myres.cv.pw025$topology == 1], myres.cv.pw050$edge.stability[myres.cv.pw050$topology == 1], myres.cv.pw075$edge.stability[myres.cv.pw075$topology == 1], myres.cv.pw1$edge.stability[myres.cv.pw1$topology == 1])
names(mystab.cv.pw) <- gg
boxplot(mystab.cv.pw, xlab="priors.weight", ylim=c(0, 1), ylab="Edge stability", border="grey", col="lightgrey")
points(x=jitter(x=rep(1:length(mystab.cv.pw), times=sapply(mystab.cv.pw, length)), amount=0.15), y=unlist(mystab.cv.pw), cex=0.55, pch=16, col="royalblue")
@

\begin{figure}[h!]
\centering
<<boxplotstabpwcvfig2,echo=FALSE,fig=TRUE,height=5,width=5,results=hide,eval=TRUE>>=
gg <- c("0", "0.25", "0.50", "0.75", "1")
mystab.cv.pw <- list(myres.cv.pw0$edge.stability[myres.cv.pw0$topology == 1], myres.cv.pw025$edge.stability[myres.cv.pw025$topology == 1], myres.cv.pw050$edge.stability[myres.cv.pw050$topology == 1], myres.cv.pw075$edge.stability[myres.cv.pw075$topology == 1], myres.cv.pw1$edge.stability[myres.cv.pw1$topology == 1])
names(mystab.cv.pw) <- gg
boxplot(mystab.cv.pw, xlab="priors.weight", ylim=c(0, 1), ylab="Edge stability", border="grey", col="lightgrey", outline=FALSE)
points(x=jitter(x=rep(1:length(mystab.cv.pw), times=sapply(mystab.cv.pw, length)), amount=0.15), y=unlist(mystab.cv.pw), cex=0.55, pch=16, col="royalblue")
@
\caption{Boxplot of the edge-specific stability computed in cross-validation, for network models with different prior weight.}
\label{fig:boxplotstabpw}
\end{figure}

\subsection{Comparison of gene-specific predictive ability}

In terms of predictive ability, the boxplot representing the gene-specific $R^2$ scores computed in cross-validation, for network models with different prior weights (Figure~\ref{fig:boxplotpw}), suggests that a combination of both data and priors yield better predictive ability.

<<boxplotr2pwcvfig,echo=TRUE,results=hide,eval=FALSE>>=
gg <- c("0", "0.25", "0.50", "0.75", "1")
myr2.cv.pw <- cbind(apply(myres.cv.pw0$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw025$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw050$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw075$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw1$prediction.score.cv$r2, 2, mean, na.rm=TRUE))
colnames(myr2.cv.pw) <- gg
gg <- factor(rep(gg, each=genen), levels=gg)
boxplot(myr2.cv.pw, xlab="priors.weight", ylim=c(0, 1), ylab="$R^2$", border="grey", col="lightgrey", outline=FALSE)
points(x=jitter(x=rep(1:ncol(myr2.cv.pw), times=nrow(myr2.cv.pw)), amount=0.15), y=as.vector(t(myr2.cv.pw)), cex=0.55, pch=16, col="royalblue")
@

\begin{figure}[h!]
\centering
<<boxplotr2pwcvfig2,echo=FALSE,fig=TRUE,height=5,width=5,results=hide,eval=TRUE>>=
gg <- c("0", "0.25", "0.50", "0.75", "1")
myr2.cv.pw <- cbind(apply(myres.cv.pw0$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw025$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw050$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw075$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw1$prediction.score.cv$r2, 2, mean, na.rm=TRUE))
colnames(myr2.cv.pw) <- gg
gg <- factor(rep(gg, each=genen), levels=gg)
boxplot(myr2.cv.pw, xlab="priors.weight", ylim=c(0, 1), ylab="$R^2$", border="grey", col="lightgrey", outline=FALSE)
points(x=jitter(x=rep(1:ncol(myr2.cv.pw), times=nrow(myr2.cv.pw)), amount=0.15), y=as.vector(t(myr2.cv.pw)), cex=0.55, pch=16, col="royalblue")
@
\caption{Boxplot of the gene-specific $R^2$ scores computed in cross-validation, for network models with different prior weight.}
\label{fig:boxplotpw}
\end{figure}

\noindent However a rigorous statistical comparison does not allow us to claim that a network model inferred from data+priors  yields always significantly better predictive ability than networks inferred from data or priors only. The Friedman test tells us that at least one of the network models yielded significantly different predictive ability; the pairwise Wilcoxon Rank Sum tests suggest that the model using the priors only (\Robject{priors.weight} = 1) is significantly less predictive than all the other network models; however the model inferred from data only does not yield statistically different predictive ability (although the p-value is close to significance)\footnote{An analysis involving more genes and better priors could show that network inferred from a combination of priors and data always lead to significantly more predictive models.}. 

<<regnetcvpriorsweightcompr2,echo=TRUE,results=verbatim,eval=TRUE>>=
## Friedman test to test whether at least one of the networks gives statically different predictive ability
print(friedman.test(y=myr2.cv.pw))

## Pairwise Wilcoxon Rank Sum tests
print(pairwise.wilcox.test(x=as.vector(myr2.cv.pw), g=gg, paired=TRUE, exact=FALSE, alternative="two.sided", p.adjust.method="bonferroni"))
@

\noindent Similar conclusions can be drawn by computing the $MCC$.

\paragraph{Validation}
The same analysis using the fully independent dataset to estimate the predictive ability of the different models supports the conclusions we drew from the cross-validation study in the training set.

<<boxplotr2pwtestfig2,echo=TRUE,results=hide,eval=TRUE>>=
gg <- c("0", "0.25", "0.50", "0.75", "1")
myr2.test.pw <- cbind(pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw0$net, data=data2), method="r2"), pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw025$net, data=data2), method="r2"), pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw050$net, data=data2), method="r2"), pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw075$net, data=data2), method="r2"), pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw1$net, data=data2), method="r2"))
colnames(myr2.test.pw) <- gg
gg <- factor(rep(gg, each=genen), levels=gg)
boxplot(myr2.test.pw, xlab="priors.weight", ylim=c(0, 1), ylab="$R^2$", border="grey", col="lightgrey", outline=FALSE)
points(x=jitter(x=rep(1:ncol(myr2.test.pw), times=nrow(myr2.test.pw)), amount=0.15), y=as.vector(t(myr2.test.pw)), cex=0.55, pch=16, col="royalblue")
@

\begin{figure}[h!]
\centering
<<boxplotr2pwtestfig2,echo=FALSE,fig=TRUE,height=5,width=5,results=hide,eval=TRUE>>=
gg <- c("0", "0.25", "0.50", "0.75", "1")
myr2.test.pw <- cbind(pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw0$net, data=data2), method="r2"), pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw025$net, data=data2), method="r2"), pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw050$net, data=data2), method="r2"), pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw075$net, data=data2), method="r2"), pred.score(data=data2, pred=netinf.predict(net=myres.cv.pw1$net, data=data2), method="r2"))
colnames(myr2.test.pw) <- gg
gg <- factor(rep(gg, each=genen), levels=gg)
boxplot(myr2.test.pw, xlab="priors.weight", ylim=c(0, 1), ylab="$R^2$", border="grey", col="lightgrey", outline=FALSE)
points(x=jitter(x=rep(1:ncol(myr2.test.pw), times=nrow(myr2.test.pw)), amount=0.15), y=as.vector(t(myr2.test.pw)), cex=0.55, pch=16, col="royalblue")
@
\caption{Boxplot of the gene-specific $R^2$ scores computed in the fully independent dataset, for network models with different prior weight.}
\label{fig:boxplotr2pwtestfig2}
\end{figure}

<<regnettestpriorsweightcompr2,echo=TRUE,results=verbatim,eval=TRUE>>=
## Friedman test to test whether at least one of the networks gives statically different predictive ability
print(friedman.test(y=myr2.test.pw))

## Pairwise Wilcoxon Rank Sum tests
print(pairwise.wilcox.test(x=as.vector(myr2.test.pw), g=gg, paired=TRUE, exact=FALSE, alternative="two.sided", p.adjust.method="bonferroni"))
@


\clearpage
\section{Future works}

The \textit{Predictive Networks} web application and the \Rfunction{predictionet} R package that we are using today are currently under active development and new features will be introduced progressively. Here is a list of the major features that will included soon:
\begin{itemize}
\item \textit{Predictive Networks} web application:
\begin{itemize}
\item Option to filter gene interactions according to their domain (e.g., breast cancer).
\item Implementation of an interactive visualization tool to allow users play with the inferred networks and identify the parts of the networks that are novel and well supported by the data.
\end{itemize}
\item \Rpackage{predictionet} R package:
\begin{itemize}
\item Implementation of a bayesian network inference method using the \Rpackage{catnet} R package.
\item Development of a novel approach to improve the completeness of the inferred network (and so get a more biological meaningful network) and its predictive ability. The idea behind this \textit{ensemble network inference} method is to generate many different networks that are equally well supported by the data and combine them in a consensus network instead of a single (local) optimum.
\item Allow users to export a Cytoscape\footnote{\url{http://www.cytoscape.org/}} file to visualize complex network in an efficient way.
\end{itemize}
\end{itemize}


\section*{Session Info}

<<sessioninfo,echo=FALSE,results=tex>>=
toLatex(sessionInfo(), locale=FALSE)
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\clearpage
\bibliographystyle{plainnat}
\bibliography{files/biblio}

\end{document}

